Homework 1: Introduction to Optimization and Least Squares
Instructions
Please submit a .qmd file along with a rendered pdf to the Brightspace page for this assignment. You may use whatever language you like within your qmd file, I recommend python, julia, or R.
Problem 1: Gradient Descent
- Consider the mathematical function defined on \(f: \mathbb{R}^2\,\to\, \mathbb{R}\):
\[ f(x,y) = (x-1)^2 + (y+2)^2, \]
Find the single critical point of this function and show that it is a global minimum.
- Suppose we wanted to find the global minimum of this function using gradient descent instead of the direct calculation from part (a). Write code to perform the gradient descent algorithm, that is perform the iteration: \[ \mathbf{v}_{n+1} = \mathbf{v}_n - \nu \nabla f(\mathbf{v}_n), \]
where the vector \(\mathbf{v} = [x y]^T\) and \(\nu\) is the learning rate.
Then test the performance of your algorithm for the learning rates \(\nu = 1, 0.1, 0.01\), by determining the number of steps required for the algorithm to satisfy the condition \(\|\mathbf{v}_n-\mathbf{v}_{\mathrm{opt}}\leq 10^{-8}\). Start with an initial guess of \(\mathbf{v}_0 = [0 0]^T\). Does the algorithm converge for all the values of the learning rate?
- Now consider a modification to \(f\) which depends on a parameter \(b\):
\[ f(x,y) = (x-1)^2 + b(y+2)^2 \]
This function has its global minimum at the same location as the original \(f\). Make contour plots of the function \(f\) in the vicinity of its global minimum for \(b=1\), \(b=3\), and \(b=10\). Then use gradient descent to find the global minimum, starting from the same initial guess as in part \(b\), but restricting to the learning rate \(\nu=0.1\). Plot the trajectories of the points \(mathbf{v}\) that gradient descent finds on top of the contour plots and compare the number of steps needed for the error to be lower than \(10^{-8}\) for the \(b=1\) case that you studied in part (b).
The differences that you observe here are a special case of a more general phenomenon: the speed of convergence of gradient descent depends on something called the condition number of the Hessian matrix (the matrix of the 2nd order partial derivatives) of the target function. The condition number for a symmetric matrix is just the ratio of the largest to smallest eigenvalues, in this case the condition number is \(b\) (or 1/\(b\)). Gradient descent performs worse and worse the larger the condition number (and large condition numbers are problematic for a wide variety of other numerical methods).
Problem 2: Solving Least Squares Problems
Generate a random \(20\times 10\) matrix \(A\) and a random 20-vector \(b\). Then, solve the least squares problem: \[ \min_{\mathbf{x}\in \mathbb{R}^{10}} \|A\mathbf{x} - \mathbf{b}\|^2 \] in the following ways:
Multiply \(\mathbf{b}\) by the Morse-Penrose Pseudoinverse \(A^+\).
Use built in functions to solve the least squares problem (i.e. in python numpy.lstsq, in R lm, and in Julia the backslash operator).
Using the \(QR\) factorization of \(A\). This factorization rewrites \(A\) as: \[ A = [Q\quad 0] [R\quad 0]^T, \] where \(Q\) is an orthonormal matrix and \(R\) is upper triangular. The least squares solution equals: \[ \mathbf{x} = R^{-1}Q^T\mathbf{b} \]
Verify that each of these solutions are nearly equal and that the residuals \(A\mathbf{x}-\mathbf{b}\) are orthogonal to the vector \(\mathbf{b}\)
Problem 3: Iterative Solutions to Least Squares
Although the pseudoinverse provides an exact formula for the least squares solutions, there are some situations in which using the exact solution is computationally difficult, particularly when the matrix \(A\) and vector \(\mathbf{b}\) have a large number of entries. In this case, \(AA^T\), which is an \(m\times m\) matrix, may require an enormous amount of memory. In these cases it may be better to use an approximate solution instead of the exact formula. There are many different approximate methods for solving least squares problems, here we will use an iterative method developed by Richardson.
This method begins with an initial guess \(\mathbf{x}^{(0)} = 0\) and calculates successive approximations as follows:
\[ \mathbf{x}^{(k+1)} = \mathbf{x}^{(k)} - \mu A^T\left(A\mathbf{x}^{(k)}-\mathbf{b}\right) \]
Here \(\mu\) is a positive paramter that has a similar interpretation to the learning rate for gradient descent. A choice that guarantees convergence is \(\mu \leq \frac{1}{\|A\|}\). The iteration is terminated when \(\|A^T(Ax^{(k)} − b)\|\) is below a user determine threshold, which indicates that the least squares optimality conditions are nearly satisfied.
- Suppose that \(\mathbf{x}\) is a solution to the least squares problem: \[ \mathbf{x} = A^+\mathbf{b} \]
Show that \(\mathbf{x}\) is a of the iteration scheme, i.e. that: \[ \mathbf{x} = \mathbf{x}^{(k)} - \mu A^T\left(A\mathbf{x}^{(k)}-\mathbf{b}\right) \]
- Generate a random 20 × 10 matrix \(A\) and 20-vector \(\mathbf{b}\), and compute the least squares solution \(\mathbf{x} = A^+\mathbf{b}\). Then run the Richardson algorithm with \(\mu = \frac{1}{\|A\|^2}\) for 500 iterations, and plot \(\|\mathbf{x}^{(k)}-\mathbf{x}\|\) to verify that \(\mathbf{x}^{(k)}\) is converging to \(\mathbf{x}\)